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Abstract 

We describe an approach to create a di¬ 
verse set of predictions with spectral learn¬ 
ing of latent-variable PCFGs (L-PCFGs). 

Our approach works by creating multiple 
spectral models where noise is added to 
the underlying features in the training set 
before the estimation of each model. We 
describe three ways to decode with mul¬ 
tiple models. In addition, we describe a 
simple variant of the spectral algorithm for 
L-PCFGs that is fast and leads to compact 
models. Our experiments for natural lan¬ 
guage parsing, for English and German, 
show that we get a significant improve¬ 
ment over baselines comparable to state of 
the art. For English, we achieve the Fi 
score of 90.18, and for German we achieve 
the Fi score of 83.38. 

1 Introduction 

It has been long identified in NEP that a diverse set 
of solutions from a decoder can be reranked or re¬ 
combined in order to improve the accuracy in var¬ 
ious problems (Henderson and Brill, 1999). Such 
problems include machine translation (Macherey 
and Och, 2007), syntactic parsing (Charniak and 
Johnson, 2005; Sagae and Eavie, 2006; Eossum 
and Knight, 2009; Zhang et al., 2009; Petrov, 
2010; Choc et al., 2015) and others (Van Halteren 
et al, 2001). 

The main argument behind the use of such a di¬ 
verse set of solutions (such as A:-best list of parses 
for a natural language sentence) is the hope that 
each solution in the set is mostly correct. There¬ 
fore, recombination or reranking of solutions in 
that set will further optimize the choice of a solu¬ 
tion, combining together the information from all 
solutions. 

In this paper, we explore another angle for the 
use of a set of parse tree predictions, where all pre¬ 


dictions are made for the same sentence. More 
specifically, we describe techniques to exploit di¬ 
versity with spectral learning algorithms for natu¬ 
ral language parsing. Spectral techniques and the 
method of moments have been recently used for 
various problems in natural language processing, 
including parsing, topic modeling and the deriva¬ 
tion of word embeddings (Euque et al., 2012; Co¬ 
hen et al., 2013; Stratos et al., 2014; Dhillon et al., 
2015; Rastogi et al., 2015; Nguyen et al., 2015; Eu 
et al, 2015). 

Cohen et al. (2013) showed how to estimate an 
E-PCEG using spectral techniques, and showed 
that such estimation outperforms the expectation- 
maximization algorithm (Matsuzaki et al., 2005). 
Their result still lags behind state of the art in natu¬ 
ral language parsing, with methods such as coarse- 
to-fine (Petrov et al., 2006). 

We further advance the accuracy of natural lan¬ 
guage parsing with spectral techniques and E- 
PCEGs, yielding a result that outperforms the orig¬ 
inal Berkeley parser from Petrov and Klein (2007). 
Instead of exploiting diversity from a A-best list 
from a single model, we estimate multiple models, 
where the underlying features are perturbed with 
several perturbation schemes. Each such model, 
during test time, yields a single parse, and all 
parses are then used together in several ways to 
select a single best parse. 

The main contributions of this paper are two¬ 
fold. Eirst, we present an algorithm for estimating 
E-PCEGs, akin to the spectral algorithm of Cohen 
et al. (2012), but simpler to understand and imple¬ 
ment. This algorithm has value for readers who 
are interested in learning more about spectral al¬ 
gorithms - it demonstrates some of the core ideas 
in spectral learning in a rather intuitive way. In 
addition, this algorithm leads to sparse grammar 
estimates and compact models. 

Second, we describe how a diverse set of predic¬ 
tors can be used with spectral learning techniques. 



Our approach relies on adding noise to the feature 
functions that help the spectral algorithm compute 
the latent states. Our noise schemes are similar 
to those described by Wang et al. (2013). We add 
noise to the whole training data, then train a model 
using our algorithm (or other spectral algorithms; 
Cohen et ah, 2013), and repeat this process mul¬ 
tiple times. We then use the set of parses we get 
from all models in a recombination step. 

The rest of the paper is organized as follows. 
In §2 we describe notation and background about 
L-PCFG parsing. In §3 we describe our new spec¬ 
tral algorithm for estimating L-PCFGs. It is based 
on similar intuitions as older spectral algorithms 
for L-PCFGs. In §4 we describe the various noise 
schemes we use with our spectral algorithm and 
the spectral algorithm of Cohen et al. (2013). In 
§5 we describe how to decode with multiple mod¬ 
els, each arising from a different noise setting. In 
§6 we describe our experiments with natural lan¬ 
guage parsing for English and German. 

2 Background and Notation 


VP 



saw 



S 



D N 


the dog 


Figure 1: The inside tree (left) and outside 
tree (right) for the nonterminal VP in the parse 
tree (S (NP (D the) (N dog) ) (VP {V 
saw) (NP (D the) (N woman)))). 

above and the internal nodes of these trees corre¬ 
spond to in terminal symbols in the L-PCFG for¬ 
mulation. 

Two important concepts that will be used 
throughout of the paper are that of an “inside tree” 
and an “outside tree.” Given a tree, the inside tree 
for a node contains the entire subtree below that 
node; the outside tree contains everything in the 
tree excluding the inside tree. See Figure 1 for an 
example. Given a grammar, we denote the space 
of inside trees by T and the space of outside trees 
by O. 


We denote by [n] the set of integers {1,. .., n}. 
For a statement F, we denote by [[F]] its indicator 
function, with values 0 when the assertion is false 
and 1 when it is true. 

An L-PCFG is a 5-tuple {M, X,V,m, n) where: 

• AA is the set of nonterminal symbols in the 
grammar. X C AA is a finite set of intermi¬ 
nals. X C AA is a finite set of preterminals. 
We assume that M = lUV, andXn X = 0. 
Hence we have partitioned the set of nonter¬ 
minals into two subsets. 

• [m] is the set of possible hidden states. 

• [n] is the set of possible words. 

• For all a G X, 6 G AA, c G AA, /ii, /i 2 , /i 3 G 
[m], we have a binary context-free rule 
a{hi) 6(/i2) c(/i 3 ). 

• For all a G X, /i G [m], x G [n], we have a 
lexical context-free rule a{h) —)• x. 

Latent-variable PCLGs are essentially equiv¬ 
alent to probabilistic regular tree grammars 
(PRTGs; Knight and Graehl, 2005) where the 
righthand side trees are of depth 1. With gen¬ 
eral PRTGs, the righthand side can be of arbitrary 
depth, where the leaf nodes of these trees corre¬ 
spond to latent states in the L-PCLG formulation 


3 Clustering Algorithm for Estimating 
L-PCFGs 

We assume two feature functions, cj): T — 
and xp: O —)■ , mapping inside and outside 

trees, respectively, to a real vector. Our training 
data consist of examples (a*^*\ 6*^*^) for 

i G {1... M}, where G AA; is an inside 
tree; is an outside tree; and 6*^*^ = 1 if is 
the root of tree, 0 otherwise. These are obtained 
by splitting all trees in the training set into inside 
and outside trees at each node in each tree. We 
then define 0“ G : 


0 “ = 


SA[|aW=a]]0(«<'>)W°»>))~ 


( 1 ) 


This mafrix is an empirical esfimafe for fhe 
cross-covariance mafrix befween fhe inside frees 
and fhe oufside frees of a given nonferminal a. An 
inside free and an oufside free are condifionally in- 
dependenf according fo fhe L-PCLG model, when 
fhe lafenf sfafe al Iheir connecting poinl is known. 
This means lhal fhe lafenf sfafe can be identified 
by finding paflerns lhal co-occur logelher in in¬ 
side and oufside frees - if is fhe only random vari¬ 
able lhal can explain such correlalions. As such, 
if we reduce fhe dimensions of using singu¬ 
lar value decomposilion (SVD), we essentially gel 



Inputs: An input treebank with the following additional in¬ 
formation: training examples ( 0 *-*^ 6 *-*^) for i £ 

{ 1 ... M}, where £ A/”; is an inside tree; is 

an outside tree; and = 1 if the rule is at the root of tree, 

0 otherwise. A function 0 that maps inside trees t to feature- 
vectors (j>(t) £ A function tp that maps outside trees o 

to feature-vectors tp{o) £ M'* . An integer k denoting the 
thin-SVD rank. An integer m denoting the number of latent 
states. 

Algorithm: 

(Step 1: Singular Value Decompositions) 

• Calculate SVD on to get [/“ £ and t>“ £ 

for each a £ Af. 

(Step 1: Projection) 

• For all i £ [M], compute = ([/“’and 
^(i) = 

• For all i £ [M], set to be the concatenation of 

and«('\ 

(Step 2: Cluster Projections) 

• For all a £ N, cluster the set | a*'®* = a} to 

get a clustering function 7 : —>■ [m] that maps a 

projected vector x^®^ to a cluster in [m]. 

(Step 3: Compute Final Parameters) 

• Annotate each node in the treebank with 7 (x’'®^). 

• Compute the probability of a rule p{a[hi\ —> 

&[/i 2 ] c[/i 3 ] I a[hi\) as the relative frequency of its ap¬ 
pearance in the cluster-annotated treebank. 

• Similarly, compute the root probabilities T:{a[h]) and 
preterminal rules p(a[/i] —>■ x | a[h]). 


Figure 2: The elustering estimation algorithm for 
L-PCFGs. 

representations for the inside trees and the outside 
trees that eorrespond to the latent states. 

This intuition leads to the algorithm that appears 
in Figure 2. The algorithm we deserihe takes as in¬ 
put training data, in the form of a treebank, deeom- 
posed into inside and outside trees at eaeh node in 
eaeh tree in the training set. 

The algorithm first performs SVD for eaeh of 
the set of inside and outside trees for all nontermi¬ 
nals.^ This step is akin to CCA, whieh has been 
used in various eontexts in NLP, mostly to derive 
representations for words (Dhillon et ah, 2015; 
Rastogi et ah, 2015). The algorithm then takes 
the representations indueed by the SVD step, and 

*We normalize features by their variance. 


elusters them - we use fc-means to do the elus¬ 
tering. Finally, it maps eaeh SVD representation 
to a eluster, and as a result, gets a eluster identi¬ 
fier for eaeh node in eaeh free in fhe fraining dafa. 
These elusfers are now freafed as lafenf sfafes fhaf 
are “observed.” We subsequenfly follow up wifh 
frequeney eounf maximum likelihood estimate fo 
esfimafe fhe probabilifies of eaeh paramefer in fhe 
L-PCFG. 

Consider for example fhe esfimafion of rules of 
fhe form a —)■ x. Following fhe elusfering sfep we 
obfain for eaeh nonterminal a and lafenf sfafe h a 
sef of rules of fhe form a[h] —)■ x. Eaeh sueh in- 
sfanee eomes from a single fraining example of a 
lexieal rule. Nexf, we eompufe fhe probabilify of 
fhe rule a[h] -A- x by eounfing how many times 
fhaf rule appears in fhe fraining insfanees, and nor¬ 
malize by fhe fofal eounf of a[h] in fhe fraining 
insfanees. Similarly, we eompufe probabilifies for 
binary rules of fhe form a ^ be. 

The fealures fhaf we use for (p and ip are sim¬ 
ilar fo fhose used in Cohen el al. (2013). These 
fealures look al fhe loeal neighborhood surround¬ 
ing a given node. More speeifieally, we indieale 
fhe following information wifh fhe inside fealures 
(fhroughoul Ihese definilions assume fhaf a ^ be 
is al fhe roof of fhe inside free t): 

• The pair of nonterminals (a, b). E.g., for fhe 
inside free in Eigure 1 Ihis would be fhe pair 
(VP, V). 

• The pair (a, c). E.g., (VP, NP). 

• The rule a ^ be. E.g., VP —V NP. 

• The rule a ^ be paired wifh fhe rule af fhe 
node b. E.g., for fhe inside free in Eigure 1 
Ibis would eorrespond fo fhe free fragmenl 
(VP (V saw) NP). 

• The rule a ^ be paired wifh fhe rule al fhe 
node c. E.g., fhe free fragmenl (VP V (NP D 
N)). 

• The head parl-of-speeeh of t paired wifh a. 
E.g., fhe pair (VP, V). 

• The number of words dominaled by t paired 
wifh a. E.g., fhe pair (VP, 3). 

In fhe ease of an inside free eonsisling of a sin¬ 
gle rule a ^ X fhe fealure veelor simply indieafes 
fhe idenlify of fhaf rule. 

Eor fhe oulside fealures, we use: 

• The rule above fhe fool node. E.g., for fhe 
oulside free in Eigure 1 Ihis would be fhe rule 




S —)■ NP VP* (the foot nonterminal is marked 
with *). 

• The two-level and three-level rule fragments 
above the foot node. These features are ab¬ 
sent in the outside tree in Figure 1. 

• The label of the foot node, together with the 
label of its parent. E.g., the pair (VP, S). 

• The label of the foot node, together with the 
label of its parent and grandparent. 

• The part-of-speeeh of the first head word 
along the path from the foot of the outside 
tree to the root of the tree whieh is different 
from the head node of the foot node. 

• The width of the spans to the left and to the 
right of the foot node, paired with the label of 
the foot node. 

Other Spectral Algorithms The SVD step on 
the matrix is pivotal to many algorithms, and 
has been used in the past for other L-PCFG esti¬ 
mation algorithms. Cohen et al. (2012) used it for 
developing a speetral algorithm that identifies fhe 
paramefers of fhe L-PCFG up fo a linear Iransfor- 
mafion. Their algorifhm generalizes fhe work of 
Hsu el al. (2009) and Bailly el al. (2010). 

Cohen and Collins (2014) also developed an al¬ 
gorifhm lhal makes use of an SVD step on fhe 
inside-oulside. If relies on fhe idea of “pivol 
fealures” - feafures lhal uniquely identify lalenl 
slates. 

Louis and Cohen (2015) used a eluslering al¬ 
gorifhm lhal resembles ours bul does nol sepa¬ 
rate inside frees from oulside frees or follows up 
wilh a singular value deeomposilion slep. Their 
algorifhm was applied fo bolh L-PCFGs and lin¬ 
ear eonlexl-free rewriling systems. Their appliea- 
lion was fhe analysis of hierarehieal slruelure of 
eonversalions in online forums. 

In our preliminary experimenls, we found oul 
lhal fhe eluslering algorifhm by ilself performs 
worse lhan fhe speelral algorifhm of Cohen el al. 
(2013). We believe lhal fhe reason is Iwo-fold: (a) 
/c-means finds a loeal maximum during eluslering; 
(b) we do hard eluslering inslead of sofl elusler- 
ing. However, we defeeled lhal Ihe eluslering algo- 
rilhm gives a more diverse sel of solutions, when 
Ihe fealures are perlurbed. As sueh, in Ihe nexl 
seelions, we explain how lo perlurb Ihe models we 
gel from Ihe eluslering algorifhm (and Ihe speelral 
algorilhm) in order lo improve Ihe aeeuraey of Ihe 
eluslering and speelral algorilhms. 


4 Spectral Estimation with Noise 

If has been shown lhal a diverse sel of predielions 
ean be used lo help improve deeoder aeeuraey for 
various problems in NLP (Henderson and Brill, 
1999). Usually a fc-besl lisl from a single model 
is used lo exploil model diversify. Instead, we es¬ 
timate multiple models, where Ihe underlying fea¬ 
lures are filtered wilh various noising sehemes. 

We fry Ihree differenl lypes of noise sehemes for 
Ihe algorilhm in Figure 2: 

Dropout noise: Let a G [0,1]. We set eaeh ele¬ 
ment in the feature veetors (j){t) and V’(o) to 
0 with probability a. 

Gaussian (additive): Let cr > 0. For eaeh 

we draw a veetor e G of Gaussians with 
mean 0 and varianee and then set <— 

x*^*^ + e. 

Gaussian (multiplicative): Let cr > 0. For eaeh 
x(*\ we draw a veetor e G of Gaussians 
with mean 0 and varianee cr^, and then set 
x(*) 2 ;(*) (g) where (g) is eoordinate- 

wise multiplieation. 

Note the distinetion between the dropout noise 
and the Gaussian noise sehemes: the first is per¬ 
formed on the feature veetors before the SVD step, 
and the seeond is performed after the SVD step. It 
is not feasible to add Gaussian noise prior to the 
SVD step, sinee the matrix will no longer be 
sparse, and its SVD eomputation will be eomputa- 
tionally demanding. 

Our use of dropout noise here is inspired by 
“dropout” as is used in neural network training, 
where various eonneetions between units in the 
neural network are dropped during training in or¬ 
der to avoid overfitting of these units to the data 
(Srivastava et ah, 2014). 

The three sehemes we deseribed were also used 
by Wang et al. (2013) to train log-linear models. 
Wang et al.’s goal was to prevent overfitting by 
introdueing this noise sehemes as additional reg- 
ularizer terms, but without explieitly ehanging the 
training data. We do filter the data through these 
noise sehemes, and show in §6 that all of these 
noise sehemes do not improve the performanee of 
our estimation on their own. However, when mul¬ 
tiple models are ereated with these noise sehemes, 
and then eombined together, we get an improved 
performanee. As sueh, our approaeh is related to 



the one of Petrov (2010), who builds a eommit- 
tee of latent-variable PCFGs in order to improve a 
natural language parser. 

We also use these perturbation sehemes to ere- 
ate multiple models for the algorithm of Cohen et 
al. (2012). The dropout seheme stays the same, 
but for the Gaussian noising sehemes, we follow a 
slightly different proeedure. After noising the pro- 
jeetions of the inside and outside feature funetions 
we get from the SVD step, we use these projeeted 
noised features as a new set of inside and outside 
feature funetions, and re-run the speetral algorithm 
of Cohen et al. (2012) on them. 

We are required to add this extra SVD step be- 
eause the speetral algorithm of Cohen et al. as¬ 
sumes the existenee of linearly transformed pa¬ 
rameter estimates, where the parameters of eaeh 
nonterminal a is linearly transformed by unknown 
invertible matriees. These matriees eaneel out 
when the inside-outside algorithm is run with the 
speetral estimate output. In order to ensure that 
these matriees still exaetly eaneel out, we have to 
follow with another SVD step as deseribed above. 
The latter SVD step is performed on a dense G 
]gmxm issue eonsidering m (the 

number of latent states) is mueh smaller than d or 
d'. 

5 Decoding with Multiple Models 

Let Gi,..., Gp be a set of L-PCFG grammars. In 
§6, we ereate sueh models using the noising teeh- 
niques deseribed above. The question that remains 
is how to eombine these models together to get a 
single best output parse tree given an input sen- 
tenee. 

With L-PCFGs, deeoding a single sentenee re¬ 
quires marginalizing out the latent states to find 
the best skeletal tree^ for a given string. Let s be a 
sentenee. We define f(Gj, s) fo be fhe oufpuf free 
aeeording fo minimum Bayes risk deeoding. This 
means we follow Goodman (1996), who uses dy- 
namie programming fo eompufe fhe free fhaf maxi¬ 
mizes fhe sum of all marginals of all nonterminals 
in the output tree. Eaeh marginal, for eaeh span 
(a, i,j) (where a is a nonterminal and i and j are 
endpoints in the sentenee), is eomputed by using 
the inside-outside algorithm. 

In addition, let |u(a, i,j\Gk, s) be the marginal, 
as eomputed by the inside-outside algorithm, for 

skeletal tree is a derivation tree without latent states 
decorating the nonterminals. 


the span {a,i,j) with grammar Gk for string s. 
We use the notation {a,i,j) G f to denote that a 
span (a, i, j) is in a tree t. 

We suggest the following three ways for deeod¬ 
ing with multiple models Gi,... ,Gp: 

Maximal tree coverage: Using dynamie pro¬ 
gramming, we return the tree that is the 
solution to: 

p 

t* = argmax ) j) £ t{Gk7 ■s)]]- 

{a,i,j)£t k=l 

This implies that we find fhe free fhaf max¬ 
imizes ifs eoverage wifh respeef fo all ofher 
frees fhaf are deeoded using Gi,..., Gp. 

Maximal marginal coverage: Using dynamic 
programming, we refurn fhe free fhaf is fhe 
solufion fo: 

p 

f = argmax ^ ^/i(a, i, j|Gfc, s). 
{a,ij)£t k=l 

This is similar fo maximal free coverage, only 
instead of considering jusf fhe single decoded 
free for each model among Gi,, Gp, we 
make our decoding “softer,” and rely on fhe 
marginals fhaf each model gives. 

MaxEnt reranking: We frain a MaxEnf reranker 
on a fraining sef fhaf includes oufpufs from 
mulfiple models, and fhen, during fesfing 
lime, decode wifh each of fhe models, and 
use fhe frained reranker fo selecl one of fhe 
parses. We use fhe reranker of Charniak and 
Johnson (2005).^ 

As we see later in §6, if is somelimes possible fo 
exlracl more information from fhe fraining dala by 
using a nelwork, or a hierarchy of fhe above free 
combinalion melhods. Eor example, we gel our 
besl resull for parsing by firsl using MaxEnf wifh 
several subsels of fhe models, and fhen combining 
fhe oufpuf of Ihese MaxEnf models using maximal 
free coverage. 

^Implementation: https : / /github. com/BLLIP / 
bllip-parser. More specifically, we used the 
programs extract-spfeatures, cvlm-lbfgs and 
best-indices, cvlm-lbfgs was used with the default 
hyperparameters from the Makefile. 





Clustering 


Spectral (smoothing) 

Spectral (no smoothing) 


MaxTre 

MaxMrg 

MaxEnt 

MaxTre 

MaxMrg 

MaxEnt 

MaxTre 

MaxMrg 

MaxEnt 

Add 

88.68 

88.64 

89.50 

88.20 

88.28 

88.59 

86.72 

86.85 

87.94 

Mul 

88.74 

88.66 

89.89 

88.48 

88.70 

89.46 

86.97 

86.53 

89.04 

Dropout 

88.68 

88.56 

89.80 

88.64 

88.71 

89.47 

88.37 

88.06 

89.52 

All 

88.84 

88.75 

89.95 

88.38 

88.75 

89.45 

87.49 

87.00 

89.85 

No noise 

1 86.48 1 

88.53 (Cohen etak, 2013) 

86.47 (Cohen etak, 2013) 


Table 1 1 Results on section 22 (WSJ). MaxTre denotes decoding using maximal tree coverage, MaxMrg denotes decoding 
using maximal marginal coverage, and MaxEnt denotes the use of a discriminative reranker. Add, Mul and Dropout denote 
the use of additive Gaussian noise, multiplicative Gaussian noise and dropout noise, respectively. The number of models used 
in the first three rows for the clustering algorithm is 80: 20 for each o £ {0.05,0.1, 0.15, 0.2}. For the spectral algorithm, 
it is 20, 5 for each o (see footnotes). The number of latent states is m = 24. For All, we use all models combined from the 
first three rows. The “No noise” baseline for the spectral algorithm is taken from Cohen et al. (2013). The best figure in each 
algorithm block is in boldface. 


6 Experiments 

In this section, we describe parsing experiments 
with two languages: English and German. 

6.1 Results for English 

For our English parsing experiments, we use a 
standard setup. More specifically, we use the Penn 
WSJ treebank (Marcus et al., 1993) for our experi¬ 
ments, with sections 2-21 as the training data, and 
section 22 used as the development data. Section 
23 is used as the final tesf set. We binarize the 
trees in training data, but transform them back be¬ 
fore evaluating them. 

For efficiency, we use a base PCFG without 
latent states to prune marginals which receive 
a value less than 0.00005 in the dynamic pro¬ 
gramming chart. The parser takes part-of-speech 
tagged sentences as input. We tag all datasets us¬ 
ing Turbo Tagger (Martins et al., 2010), trained on 
sections 2-21. We use the Fi measure according 
to the PARSEVAE metric (Black et al., 1991) for 
the evaluation. 

Preliminary experiments We first experiment 
with the number of latent states for the clustering 
algorithm without perturbations. We use k = 100 
for the SVD step. Whenever we need to cluster 
a set of points, we run the A:-means algorithm 10 
times with random restarts and choose the clus¬ 
tering result with the lowest objective value. On 
section 22, the clustering algorithm achieves the 
following results {Fi measure): m = 8: 84.30%, 
m = 16: 85.98%, m = 24: 86.48%, m = 32: 
85.84%, m = 36: 86.05%, m = 40: 85.43%. 
As we increase the number of states, performance 
improves, but plateaus at m = 24. For the rest of 
our experiments, both with the spectral algorithm 
of Cohen et al. (2012) and the clustering algorithm 
presented in this paper, we use m = 24. 
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Figure 3: Fi scores of noisy models. Each data 
point gives the Fi accuracy of a single model on 
the development set, based on the legend. The x- 
axis enumerates the models (80 in total for each 
noise scheme). 

Compact models One of the advantage of the 
clustering algorithm is that it leads to much more 
compact models. The number of nonzero param¬ 
eters with m = 24 for the clustering algorithm is 
approximately 97 K, while the spectral algorithms 
lead to a significantly larger number of nonzero 
parameters with the same number of latent states: 
approximately 54 million. 

Oracle experiments To what extent do we get 
a diverse set of solutions from the different mod¬ 
els we estimate? This question can be answered by 
testing the oracle accuracy in the different settings. 
For each type of noising scheme, we generated 80 






Melhod 

Fi 


Specfral (unsmoolhed) 

89.21 

PQ 

Specfral (smoofhed) 

88.87 


Clustering 

89.25 

u 

Specfral (unsmoolhed) 

89.09 


Specfral (smoofhed) 

89.06 


Clustering 

90.18 


Table 2: Results on seetion 23 (English). The first 
three results (Best) are taken with the best model 
in eaeh eorresponding bloek in Table 1. The last 
three results (Hier) use a hierarehy of the above 
tree eombination methods in eaeh bloek. It eom- 
bines all MaxEnt results using the maximal tree 
eoverage (see text). 

models, 20 for eaeh a G {0.05,0.1,0.15,0.2}. 
Each noisy model by itself lags behind the best 
model (see Eigure 3). However, when choosing 
the best tree among these models, the additively- 
noised models get an oracle accuracy of 95.91% 
on section 22; the multiplicatively-noised models 
get an oracle accuracy of 95.81 %; and the dropout- 
noised models get an oracle accuracy of 96.03%. 
Einally all models combined get an oracle accu¬ 
racy of 96.67%. We found out that these oracle 
scores are comparable to the one Charniak and 
Johnson (2005) report. 

We also tested our oracle results, comparing 
the spectral algorithm of Cohen et al. (2013) to 
the clustering algorithm. We generated 20 mod¬ 
els for each type of noising scheme, 5 for each 
a G {0.05,0.1,0.15,0.2}) for the spectral al¬ 
gorithm.^ Surprisingly, even though the spectral 
models were smoothed, their oracle accuracy was 
lower than the accuracy of the clustering algo¬ 
rithm: 92.81% vs. 95.73%.^ This reinforces two 
ideas: (i) that noising acts as a regularizer, and has 
a similar role to backoff smoothing, as we see be¬ 
low; and (ii) the noisy estimation for the clustering 
algorithm produces a more diverse set of parses 
than that produced with the spectral algorithm. 

It is also important to note that the high ora¬ 
cle accuracy is not just the result of /c-means not 

"'There are two reasons we use a smaller number of mod¬ 
els with the spectral algorithm: (a) models are not compact 
(see text) and (b) as such, parsing takes comparatively longer. 
However, in the above comparison, we use 20 models for the 
clustering algorithm as well. 

^Oracle scores for the clustering algorithm: 95.73% (20 
models for each noising scheme) and 96.67% (80 models for 
each noising scheme). 


finding fhe global maximum for fhe clusfering ob- 
Jecfive. If we Jusf run fhe clusfering algorifhms 
wifh 80 models as before, wifhouf perfurbing fhe 
fealures, fhe oracle accuracy is 95.82%, which is 
lower fhan fhe oracle accuracy wifh fhe addifive 
and dropouf perfurbed models. To add fo fhis, we 
see below fhaf perfurbing fhe fraining sef wifh fhe 
specfral algorifhm of Cohen ef al. improves fhe ac¬ 
curacy of fhe specfral algorifhm. Since fhe specfral 
algorifhm of Cohen ef al. does nol maximize any 
objecfive locally, if shows fhaf fhe role of fhe per- 
furbafions we use is imporfanf. 

Results Resulfs on fhe developmenf sef are 
given in Table 1 wifh our fhree decoding mefhods. 
We presenf fhe resulfs from fhree algorifhms: fhe 
clusfering algorifhm and fhe specfral algorifhms 
(smoofhed and unsmoofhed).^ 

If seems fhaf dropouf noise for fhe specfral algo- 
rifhm acfs as a regularizer, similarly fo fhe back¬ 
off smoofhing fechniques fhaf are used in Cohen 
el al. (2013). This is evidenf from fhe fwo specfral 
algorifhm blocks in Table 1, where dropouf noise 
does nol subslanlially improve fhe smoofhed spec- 
Iral model (Cohen ef al. reporl accuracy of 88.53% 
wifh smoofhed specfral model for m = 24 wifhouf 
noise) - fhe accuracy is 88.64%-88.71%-89.47%, 
bul fhe accuracy subslanlially improves for fhe un- 
smoolhed specfral model, where dropouf brings an 
accuracy of 86.47% up fo 89.52%. 

All fhree blocks in Table 1 demonslrale fhaf 
decoding wifh fhe MaxEnl reranker performs fhe 
besl. Also if is inleresfing fo nole fhaf our resulfs 
conlinue fo improve when combining fhe oulpul of 
previous combinafion sleps furlher. The besl re- 
sull on section 22 is achieved when we combine, 
using maximal free coverage, all MaxEnf oulpufs 
of fhe clusfering algorifhm (fhe firsl block in Ta¬ 
ble 1). This yields a 90.68% Fi accuracy. This is 
also fhe besl resull we gel on fhe lesl sef (section 
23), 90.18%. See Table 2 for resulfs on section 23. 

Our resulfs are comparable fo slale-of-lhe-arl 
resulfs for parsing. Eor example, Sagae and Eavie 
(2006), Eossum and Knighl (2009) and Zhang el 
al. (2009) reporl an accuracy of 93.2%-93.3% us- 

^Cohen et al. (2013) propose two variants of spectral 
estimation for L-PCFGs: smoothed and unsmoothed. The 
smoothed model uses a simple backedoff smoothing method 
which leads to significant improvements over the unsmoothed 
one. Here we compare our clustering algorithm against both 
of these models. However unless specified otherwise, the 
spectral algorithm of Cohen et al. (2013) refers to their best 
model, i.e. the smoothed model. 






Clustering 


Spectral (smoothing) 

Spectral (no smoothing) 


MaxTre 

MaxMrg 

MaxEnt 

MaxTre 

MaxMrg 

MaxEnt 

MaxTre 

MaxMrg 

MaxEnt 

Add 

77.34 

76.87 

80.01 

77.76 

77.85 

78.09 

77.44 

77.56 

77.91 

Mul 

77.80 

77.80 

80.34 

77.80 

77.76 

78.89 

77.62 

77.85 

78.94 

Dropout 

77.37 

77.17 

80.94 

77.94 

78.06 

79.02 

77.97 

78.17 

79.18 

All 

77.71 

77.51 

80.86 

78.04 

77.89 

79.46 

77.73 

77.91 

79.66 

No noise 

1 75.04 1 

77.71 

77.07 


Table 3: Results on the development set for German. See Table 1 for interpretation of MaxTre, MaxMrg, MaxEnt and 
Add, Mul and Dropout. The number of models used in the first three rows for the clustering algorithm is 80: 20 for each 
(T G {0.05, 0.1, 0.15,0.2}. For the spectral algorithm, it is 20, 5 for each a. The number of latent states is m = 8. For All, we 
use all models combined from the first three rows. The best figure in each algorithm block is in boldface. 


ing parsing recombination; Shindo et al. (2012) 
report an accuracy of 92.4 Fi using a Bayesian 
tree substitution grammar; Petrov (2010) reports 
an accuracy of 92.0% using product of L-PCFGs; 
Charniak and Johnson (2005) report accuracy of 
91.4 using a discriminative reranking model; Car¬ 
reras et al. (2008) report 91.1 Fi accuracy for a 
discriminative, perceptron-trained model; Petrov 
and Klein (2007) report an accuracy of 90.1 Fi. 
Collins (2003) reports an accuracy of 88.2 Fi. 

6.2 Results for German 

For the German experiments, we used the NEGRA 
corpus (Skut et al., 1997). We use the same setup 
as in Petrov (2010), and use the first 18,602 sen¬ 
tences as a training set, the next 1,000 sentences as 
a development set and the last 1,000 sentences as 
a test set. This corresponds to an 80%-10%-10% 
split of the treebank. 

Our German experiments follow the same set¬ 
ting as in our English experiments. Eor the clus¬ 
tering algorithm we generated 80 models, 20 for 
each IT G {0.05,0.1,0.15,0.2}. Eor the spectral 
algorithm, we generate 20 models, 5 for each a. 

Eor the reranking experiment, we had to modify 
the BEEIP parser (Charniak and Johnson, 2005) 
to use the head features from the German tree- 
bank. We based our modifications on the docu¬ 
mentation for the NEGRA corpus (our modifica¬ 
tions are based mostly on mapping of nontermi¬ 
nals to coarse syntactic categories). 

Preliminary experiments Eor German, we also 
experiment with the number of latent states. On 
the development set, we observe that the Fi mea¬ 
sure is: 75.04% for m = 8, 73.44% for m = 16 
and 70.84% for m = 24. Eor the rest of our experi¬ 
ments, we fix fhe number of lafenf sfafes af m = 8. 

Oracle experiments The addifively-noised 
models get an oracle accuracy of 90.58% on 
the development set; the multiplicatively-noised 



Method 

Fi 

<1—t 

Spectral (unsmoothed) 

80.88 


Spectral (smoothed) 

80.31 


Clustering 

81.94 


Spectral (unsmoothed) 

80.64 

OJ 

Spectral (smoothed) 

79.96 


Clustering 

83.38 


Table 4: Results on the test set for the German 
data. The first three results (Best) are taken with 
the best model in each corresponding block in Ta¬ 
ble 3. The last three results (Hier) use a hierarchy 
of the above tree combination methods. 

models get an oracle accuracy of 90.47%; and 
the dropout-noised models get an oracle accuracy 
of 90.69%. Einally all models combined get an 
oracle accuracy of 92.38%. 

We compared our oracle results to those given 
by the spectral algorithm of Cohen et al. (2013). 
With 20 models for each type of noising scheme, 
all spectral models combined achieve an oracle ac¬ 
curacy of 83.45%. The clustering algorithm gets 
the oracle score of 90.12% when using the same 
number of models. 

Results Results on the development set and on 
the test set are given in Table 3 and Table 4 re¬ 
spectively. 

Eike English, in all three blocks in Table 3, de¬ 
coding with the MaxEnt reranking performs the 
best. Our results continue to improve when fur¬ 
ther combining the output of previous combina¬ 
tion steps. The best result of 82.04% on the devel¬ 
opment set is achieved when we combine, using 
maximal tree coverage, ah MaxEnt outputs of the 
clustering algorithm (the first block in Table 3). 
This also leads to the best result of 83.38% on the 
test set. See Table 4 for results on the test set. 

Our results are comparable to state-of-the-art 
results for German parsing. Eor example, Petrov 
(2010) reports an accuracy of 84.5% using prod- 





uct of L-PCFGs; Petrov and Klein (2007) report 
an aeeuraey of 80.1 Fi, and Dubey (2005) reports 
an aeeuraey of 76.3 Fi. 

7 Discussion 

From a theoretieal point of view, one of the 
great advantages of speetral learning teehniques 
for latent-variable models is that they yield eonsis- 
tent parameter estimates. Our elustering algorithm 
for L-PCFG estimation breaks this, but there is a 
work-around to obtain an algorithm whieh would 
be statistieally eonsistent. 

The main reason that our algorithm is not a eon¬ 
sistent estimator is that it relies on /c-means elus¬ 
tering, which maximizes a non-convex objective 
using hard clustering steps. The /c-means algo¬ 
rithm can be viewed as “hard EM” for a Gaussian 
mixture model (GMM), where each latent state is 
associated with one of the mixture components in 
the GMM. This means that instead of following up 
with /c-means, we could have identified the param¬ 
eters and the posteriors for a GMM, where the ob¬ 
servations correspond to the vectors that we clus¬ 
ter. There are now algorithms, some of which are 
spectral, that aim to solve this estimation problem 
with theoretical guarantees (Vempala and Wang, 
2004; Kannan et al., 2005; Moitra and Valiant, 
2010 ). 

With theoretical guarantees on the correctness 
of the posteriors from this step, the subsequent 
use of maximum likelihood estimation step could 
yield consistent parameter estimates. The con¬ 
sistency guarantees will largely depend on the 
amount of information that exists in the base fea¬ 
ture functions about the latent states according to 
the L-PCFG model. 

8 Conclusion 

We presented a novel estimation algorithm for 
latent-variable PCFGs. This algorithm is based 
on clustering of continuous tree representations, 
and it also leads to sparse grammar estimates and 
compact models. We also showed how to get a 
diverse set of parse tree predictions with this algo¬ 
rithm and also older spectral algorithms. Each pre¬ 
diction in the set is made by training an L-PCEG 
model after perturbing the underlying features that 
estimation algorithm uses from the training data. 
We showed that such a diverse set of predictions 
can be used to improve the parsing accuracy of En¬ 
glish and German. 
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